In last lesson, you have seen some basic visualization codes to help you explore data. The theme of last lesson is on data while visulization serves as a tool to explore it. In this lesson, we are going to focus on visualzation itself and teach some basic techniques on how to make your figures more appealing. We are also going to cover how to plot charts on maps.

Chart Components

In this section, we will cover how to configure various components of charts in ggplot2. We are going to use custdata again.

custdata<-read.table('custdata.tsv',header=T,sep = '\t')

Title

Chart title is set by using ggtitle function.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
g1 = ggplot(data=custdata, aes(x=age,y=income)) + geom_point(color="blue") + ggtitle("Age vs Income")
g1

Axis

As you can see above, R will give default x and y labels based on the variable names. You can overwrite these by using labs() function.

g2 = g1 + labs(x="Customer Age", y="Annual Income")
g2

Limit an axis a range

Sometimes if necessary you may want to limit an axis to a specific range. There are three ways to achieve this.

#Method 1. use xlim
g3 = g2 + xlim(c(0,100))
g3
## Warning: Removed 8 rows containing missing values (geom_point).

#Method 2. use scale_x_continous. This method will remove all points outside the range
g4 = g2 + scale_x_continuous(limits = c(0,100))
g4
## Warning: Removed 8 rows containing missing values (geom_point).

#Method 3. use coord_cartesian. This method wil adjust the display area
g5 = g2 + coord_cartesian(xlim=c(0,100))
g5

Use a function to alter labels

In last lesson, you have seen how to add dollar sign to income labels. A more advanced technique is to use a function to alter labels in whatever format you want. The following is an example showing how to do that.

g6<-ggplot(custdata) + geom_bar(aes(x=health.ins)) 
g7<-g6+scale_x_discrete(labels = function(x) ifelse(x, "Has Insurance", "Without Insurance"))
g7

Colour points by categorical variable

We may colour data points on a chart by an independent categorical variable. For example, we may want to see if gender make a difference in the relationship between income and age. This could be achieved by:

g8 <-  ggplot(data=custdata, aes(x=age,y=income,color=factor(sex))) + geom_point()
g8

You may manually set the colours.

g8_1<-g8 + scale_color_manual(values=c("yellow","blue"))
g8_1

Working with theme

So far, we have been talking about basic components of R charts. We can further configure the outlook of these charts by using theme() function, which allow us to modify the theme settings for every part of a chart. We will cover some most common scenarios in this section. You are encouraged to explore more on your own.

Title theme

Title is basically text. Configure title thus is to do with setting right arguments in element_text component. Below are some examples.

g9<-g1 + theme(title=element_text(size=20,face="bold",color="green"))
g9

As you can see, all the titles are affected by this setting. If we want to change only the plot title, it can be done like the following:

g10 <- g1 + theme(plot.title=element_text(size=20,face="bold",color="green"))
g10

Tick text theme

We can also change the theme of tick text.

g11 <- g7 + theme(axis.text.x=element_text(angle=50,size=10,vjust = 0.5))
g11

Background colour

We can also change the background colour of a chart.

#this will change the background colour of the whole panel
g12 <- g1 + theme(panel.background = element_rect(fill = "yellow"))
g12

#this will change the backgrond colour of the plot area 
g13 <- g1 + theme(plot.background = element_rect(fill = "yellow"))
g13

Grid line theme

Grid lines can be configured using panel.grid.* series.

g14 <- g1 + theme(panel.grid.major = element_line(color = "yellow", size = 2), panel.grid.minor=element_line(color = "blue"))
g14

Multi-panel plots

Sometimes, it is more visual effective to put some panels side by side for comparison.

g15 <-  ggplot(data=custdata, aes(x=age,y=income)) + geom_point() + facet_wrap(~sex,ncol = 1)
g15

You may try other layout functions, such as facet_grid.

Please refer to Beautiful plotting in R: A ggplot2 cheatsheet for more information.

Exercise

Use LearningANTS data to do good visualization.

Other Packages

There are some other R packages that you may consider for data visualization. One package to recommend is lattice. In this session, we will show some examples on how to use it. We are going to use “Lasagna Triers.csv”, which stores data about customer profiles on lasagna triers.

colclasses = c("integer","integer","numeric","numeric","factor","numeric","numeric","factor","factor","factor","integer","factor","factor")
triers <- read.csv("Lasagna Triers.csv",header = TRUE, colClasses = colclasses)
str(triers)
## 'data.frame':    856 obs. of  13 variables:
##  $ Person   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  48 33 51 56 28 51 44 29 28 29 ...
##  $ Weight   : num  175 202 188 244 218 173 182 189 200 209 ...
##  $ Income   : num  65500 29100 32200 19000 81400 73000 66400 46200 61100 9800 ...
##  $ PayType  : Factor w/ 2 levels "Hourly","Salaried": 1 1 2 1 2 2 2 2 2 2 ...
##  $ CarValue : num  2190 2110 5140 700 26620 ...
##  $ CCDebt   : num  3510 740 910 1620 600 950 3500 2860 3180 1270 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 1 2 2 1 ...
##  $ LiveAlone: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ DwellType: Factor w/ 3 levels "Apt","Condo",..: 3 2 2 3 1 2 2 2 2 1 ...
##  $ MallTrips: int  7 4 1 3 3 2 6 5 10 7 ...
##  $ Nbhd     : Factor w/ 3 levels "East","South",..: 1 1 1 3 3 1 3 3 3 1 ...
##  $ HaveTried: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 2 2 2 ...

Histogram

lattice allows us to plot histogram easily. For example,

library(lattice)
histogram(~Income, data=triers)

It also allows you to condition histograms on the value of a third party categorical variable. For example,

#Compare income between genders
histogram(~Income | Gender, data=triers)

#Compare income among neighborhoods. It is more effective to show the comparison in this layout
histogram(~Income | Nbhd, data=triers, layout=c(1,3))

Density plot

Similar to histogram, you can develop density plot easily.

#Compare car value between genders
densityplot(~CarValue | Gender, data = triers, layout=c(1,2), col="black")

Dot plot

This is an example using dot plot.

dotplot(~CarValue | Nbhd, data = triers, layout=c(1,3))

Advanced scatter plot

Using lattice, it is very convenient to develop conditional scatter plots on a third party categorical variable.

xyplot(Income~CarValue | Gender, data = triers, layout=c(1,2))

Box plot

We can build conditional box plots using bwplot function in lattice.

bwplot(Weight~factor(Gender) | factor(Nbhd), data = triers, xlab = "Gender")

t1 <- tapply(triers$Income, INDEX =list(cut(triers$Weight,breaks=10), cut(triers$CarValue,breaks=10)), FUN=mean,na.rm =TRUE)
t1
##           (96.3,3.5e+03] (3.5e+03,6.88e+03] (6.88e+03,1.03e+04]
## (142,154]       34790.00           49030.00            63212.50
## (154,165]       31480.56           41759.26            53718.18
## (165,177]       34464.29           45681.25            60941.67
## (177,188]       28750.00           46593.94            56530.00
## (188,200]       31384.21           46917.39            73018.18
## (200,212]       31759.09           53064.52            45377.78
## (212,223]       30241.46           54520.00            78400.00
## (223,235]       29028.57           48550.00            45633.33
## (235,246]       39308.33           36484.62            92237.50
## (246,258]       23766.67           30475.00            36175.00
##           (1.03e+04,1.36e+04] (1.36e+04,1.7e+04] (1.7e+04,2.04e+04]
## (142,154]            75433.33           44100.00              65600
## (154,165]            58083.33           65366.67              75300
## (165,177]            65255.56           67566.67              93000
## (177,188]            56062.50           52950.00              66600
## (188,200]            67383.33           60366.67              66600
## (200,212]            61520.00           56150.00             100600
## (212,223]            51916.67           64650.00              87750
## (223,235]            59200.00           41200.00                 NA
## (235,246]            47500.00           63750.00              71800
## (246,258]            74500.00                 NA             147700
##           (2.04e+04,2.37e+04] (2.37e+04,2.71e+04] (2.71e+04,3.05e+04]
## (142,154]                  NA                  NA                  NA
## (154,165]                  NA           118900.00               92600
## (165,177]               67050            79350.00                  NA
## (177,188]               56800            60083.33                  NA
## (188,200]               54350                  NA               78925
## (200,212]                  NA                  NA                  NA
## (212,223]               48500            82200.00               48000
## (223,235]                  NA                  NA                  NA
## (235,246]               69900            42400.00                  NA
## (246,258]                  NA                  NA                  NA
##           (3.05e+04,3.39e+04]
## (142,154]                  NA
## (154,165]                  NA
## (165,177]               90900
## (177,188]                  NA
## (188,200]                  NA
## (200,212]                  NA
## (212,223]                  NA
## (223,235]               44400
## (235,246]                  NA
## (246,258]                  NA
levelplot(t1)

levelplot(t1, scales=list(x=list(rot=90)))

t2 <- tapply(triers$Income, INDEX =list(triers$Gender, cut(triers$CarValue,breaks=10)), FUN=mean,na.rm =TRUE)
levelplot(t2, scales=list(x=list(rot=90)))

Excercise

Use the three data sets in Chapter 2 of "Data Mining and Business Analytics with R" to do data visualization and develop insights.

Visualizing spatial data

“ggmap” is a package developed on top of ggplot2 for visualizing spatial data. It situates contextual information of various kinds of static maps in the ggplot2 plotting framework. The result is an easy, consistent way of specifying plots which are readily interpretable by both expert and audience and safeguarded from graphical inconsistencies by the layered grammar of graphics framework.

Concept of ggmap

One advantage of making the plots with ggplot2 is the layered grammar of graphics on which ggplot2 is based. By definition, the layered grammar demands that every plot consist of five components :

  • a default dataset with aesthetic mappings,
  • one or more layers, each with a geometric object (“geom”), a statistical transformation (“stat”), and a dataset with aesthetic mappings (possibly defaulted),
  • a scale for each aesthetic mapping (which can be automatically generated),
  • a coordinate system, and
  • a facet specification.

Since ggplot2 is an implementation of the layered grammar of graphics, every plot made with ggplot2 has each of the above elements. Consequently, ggmap plots also have these elements, but certain elements are fixed to map components : the x aesthetic is fixed to longitude, the y aesthetic is fixed to latitude, and the coordinate system is fixed to the Mercator projection.

A basic framework is to get the map and then overlay it with other ggplot2 charts. The following example illustrates the idea.

library(ggmap)
## Warning: package 'ggmap' was built under R version 3.3.2
#you can get lon and lat of a location by zip code
#pizzahut.location$Location <- paste("Singapore", pizzahut.location$Zipcode, sep = " ")

pizzahut.location <- read.csv("PizzaHut.csv",header = TRUE, colClasses = c("character","character","factor","character","numeric","numeric"))

#Define the map and the base_layer, whihc is equivalent to ggplot in previous sections
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
#Now plot the points, and colour the points based on regions
m2<-m1 + geom_point(aes(color=Region))
m2

You can configure how to display the points just like you do it in normal ggplot2 charts. For example, suppose we want to plot the sizes of the points based on a third party variable, you can do it in geom_point function alone.

pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m3 <- m1 + geom_point(aes(color=Region, size=Visits))
m3

You can overlay this further by other chart types. For example,

pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m4 <- m1 + geom_point(aes(color=Region)) + geom_path()
m4

pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m5 <- m1 + stat_bin2d(aes(color=Region,fill=Region)) 
m5

Please refer to ggmap: sptial visualization with ggplot2 for more details.

Different types of maps

ggmap can work on other maps as well and can be plotted in various types.

Sources of maps:

  • Google Maps - “google”
  • OpenStreetMap - “osm”
  • Stamen Maps - “stamen”
  • CloudMade maps - “cloudmade”

Map types:

  • Google map - “terrain”,“terrain-background”, “satellite”, “roadmap”,“hybrid”
  • Stamen maps - “terrain”, “watercolor”, “toner”
  • cloudmade maps - a positive integer, see ?get_cloudmademap

Let’s try a few combinations here:

pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", maptype="satellite", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m6 <- m1 + geom_point(aes(color=Region)) 
m6

Polygons shaping

Ploting charts on a map is useful for many spatial analytics. However, a more appealing visulization is to shape areas in a map with regards to different attributes. Packages “raster” and “rgdal” can be used for this purpose.

“CO2Emission.R” is an exmaple to be used.

Exercise

Obtain spatial data and other government data from data.gov.sg and develop a visulation on Singapore map.